State Space Reduction For Hierarchical Reinforcement Learning
نویسندگان
چکیده
This paper provides new techniques for abstracting the state space of a Markov Decision Process (MDP). These techniques extend one of the recent minimization models, known as -reduction, to construct a partition space that has a smaller number of states than the original MDP. As a result, learning policies on the partition space should be faster than on the original state space. The technique presented here extends reduction to SMDPs by executing a policy instead of a single action, and grouping all states which have a small difference in transition probabilities and reward function under a given policy. When the reward structure is not known, a two-phase method for state aggregation is introduced and a theorem in this paper shows the solvability of tasks using the two-phase method partitions. These partitions can be further refined when the complete structure of reward is available. Simulations of different state spaces show that the policies in both MDP and this representation achieve similar results and the total learning time in partition space in presented approach is much smaller than the total amount of time spent on learning on the original state space. Introduction Markov decision processes (MDPs) are useful ways to model stochastic environments, as there are well established algorithms to solve these models. Even though these algorithms find an optimal solution for the model, they suffer from the high time complexity when the number of decision points is large(Parr 1998; Dietterich 2000). To address increasingly complex problems a number of approaches have been used to design state space representations in order to increase the efficiency of learning (Dean Thomas; Kaelbling & Nicholson 1995; Dean & Robert 1997). Here particular features are hand-designed based on the task domain and the capabilities of the learning agent. In autonomous systems, however, this is generally a difficult task since it is hard to anticipate which parts of the underlying physical state are important for the given decision making problem. Moreover, in hierarchical learning approaches the required information might change over time as increasingly competent actions become available. The same can be observed in biological systems where information about all muscle Copyright c © 2004, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. fibers is initially instrumental to generate strategies for coordinated movement. However, as such strategies become established and ready to be used, this low-level information does no longer have to be consciously taken into account. The methods presented here build on the -reduction technique developed by Dean et al.(Givan & Thomas 1995) to derive representations in the form of state space partitions that ensure that the utility of a policy learned in the reduced state space is within a fixed bound of the optimal policy. The presented methods here extend the -reduction technique by including policies as actions and thus using it to find approximate SMDP reductions. Furthermore it derives partitions for individual actions and composes them into representations for any given subset of the action space. This is further extended by permitting the definition of two-phase partitioning that is initially reward independent and can later be refined once the reward function is known. In particular the techniques described in the following subsections are to extend -reduction(Thomas Dean & Leach 1997) by introducing the following methods: • Temporal abstraction • Action dependent decomposition • Two-phase decomposition Formalism A Markov decision processes (MDP ) is a 4-tuple (S,A, P,R) where S is the set of states, A is a set of actions available in each state, P is a transition probability function that assigns a value 0 ≤ p ≤ 1 to each state-action pair, and R is the reward function. A transition function is a map P : S × A × S → [0, 1] and usually is denoted by P (s|s, a), which is the probability that executing action a in state s will lead to state s . Similarly, a reward function is a map R : S × A → and R(s, a) denotes the reward gained by executing action a in state s. Any policy defines a value function and the Bellman equation (Bellman 1957; Puterman 1994) creates a connection between the value of each state and the value of other states by: V π(s) = R(s, π(s)) + γ ∑ s ′ P (s′|s, π(s))V π(s′) Previous Work State space reduction methods use the basic concepts of a MDP such as transition probabilities and reward function to represent a large class of states with a single state of the abstract space.The most important issues that show the generated abstraction is a valid approximate MDP are: 1. The difference between the transition function and reward function in both models has to be a small value. 2. For each policy on the original state space there must exist a policy in the abstract model. And if a state s is not reachable from state s in the abstract model, then there should not exist a policy that leads from s to s in the original state space. SMDPs One of the approaches in treating temporal abstraction is to use the theory of semi Markov decision processes (SMDPs). The actions in SMDPs take a variable amount of time and are intended to model temporally extended actions, represented as a sequence of primary actions. Policies: A policy (option) in SMDPs is a triple oi = (Ii, πi, βi)(Boutillier & Hanks 1995), where Ii is an initiation set, πi : S × A −→ [0, 1] is a primary policy and βi : S −→ [0, 1] is a termination condition. When a policy oi is executed, actions are chosen according to πi until the policy terminates stochastically according to βi. The initiation set and termination condition of a policy limit the range over which the policy needs to be defined and determine its termination. Given any set of multi-step actions, we consider the policy over those actions. In this case we need to generalize the definition of value function. The value of a state s under an SMDP policy π is defined as(Boutillier & Goldszmidt 1994):
منابع مشابه
Hierarchical Functional Concepts for Knowledge Transfer among Reinforcement Learning Agents
This article introduces the notions of functional space and concept as a way of knowledge representation and abstraction for Reinforcement Learning agents. These definitions are used as a tool of knowledge transfer among agents. The agents are assumed to be heterogeneous; they have different state spaces but share a same dynamic, reward and action space. In other words, the agents are assumed t...
متن کاملHierarchical Reinforcement Learning in Computer Games
Hierarchical reinforcement learning is an increasingly popular research field. In hierarchical reinforcement learning the complete learning task is decomposed into smaller subtasks that are combined in a hierarchical network. The subtasks can then be learned independently. A hierarchical decomposition can potentially facilitate state abstractions (i.e., bring forth a reduction in state space co...
متن کاملLiterature Review
Reinforcement learning is an attractive method of machine learning. However, as the state space of a given problem increases, reinforcement learning becomes increasingly inefficient. Hierarchical reinforcement learning is one method of increasing the efficiency of reinforcement learning. It involves breaking the overall goal of a problem into a hierarchy subgoals, and then attempting to achieve...
متن کاملState Abstraction in MAXQ Hierarchical Reinforcement Learning
Many researchers have explored methods for hierarchical reinforcement learning (RL) with temporal abstractions, in which abstract actions are defined that can perform many primitive actions before terminating. However, little is known about learning with state abstractions, in which aspects of the state space are ignored. In previous work, we developed the MAXQ method for hierarchical RL. In th...
متن کاملHierarchical reinforcement learning with subpolicies specializing for learned subgoals
This paper describes a method for hierarchical reinforcement learning in which high-level policies automatically discover subgoals, and low-level policies learn to specialize for different subgoals. Subgoals are represented as desired abstract observations which cluster raw input data. High-level value functions cover the state space at a coarse level; low-level value functions cover only parts...
متن کاملHierarchical Neuro-Fuzzy Systems Part II
This paper describes a new class of neuro-fuzzy models, called Reinforcement Learning Hierarchical NeuroFuzzy Systems (RL-HNF). These models employ the BSP (Binary Space Partitioning) and Politree partitioning of the input space [Chrysanthou,1992] and have been developed in order to bypass traditional drawbacks of neuro-fuzzy systems: the reduced number of allowed inputs and the poor capacity t...
متن کامل